Boston: A Cold City

Jin Shutima Han, Wanqi Peng, Matt Dailis

Table of Contents

Introduction

Boston is a cold city. The weather in Boston affects the city in various aspects. Among these, we are mainly looking for the relationship between weather and 311 reports.

To begin with, our datasets are from four sources: Boston Weather, 311 Dataset, Median Household Income and Boston population by neighborhood. We extracted the useful information from these datasets and combined them into the final dataset. Based on the dataset, we came up with several questions to investigate: How many snow-related 311 requests are there? What types of 311 request types happen in extreme weather? How do 311 reports vary by region and median household income, especially in extreme weather conditions?

Abstract

By looking at the word “snow” that appeared in the 311 requests when there was snowfall or there wasn’t snowfall, the results are pretty similar to what we expected. The results show us how seriously the snowfall affects Bostonians. For the request types that occur in extreme weather, we lay emphasis on three typical weather types: snowy, windy and chilly in winter Boston. By generating the reports into the word clouds, we intend to see what types of problem that Bostonians encounter during the severe weather in winter. If we go deeper to look for a correlation between these three weather events with each region, we would know more specifically the demand for 311 services and troubles that people in each region encounter. Regions of Boston vary by income. We look through the relationship between income and the 311 reports residents of each region make in order to draw the conclusion of each type of request that wealthy or poor Bostonians make in snowy and chilly weather.

Datasets

Boston Weather (source)

After sampling several weather datasets (and creating more temporary accounts and signing up for more free trials than we can count) we decided to use this dataset that was posted to kaggle.com (a data programming competition website). It suits our purposes because it provides day by day information on what the weather was like in Boston over the past several years.

311 Dataset (source)

We use the standard 311 dataset from data.boston.gov.

Median Household Income (source)

We use census data from boston.gov. This part we input manually, since there are not very many neighborhoods in Boston.

Boston population by neighborhood (source)

When we realized we need to know the population of each neighborhood in Boston, we pulled that information from here.

Cleaning up our Data

Before we can ask any questions of our data, we need to put it in a form condusive to analysis. We start by loading each dataset, and cleaning up those columns that will be used for merging with other datasets. We also do any computations that are most easily done pre-merging.

In [1]:
# All imports go in this block
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from re import sub
from decimal import Decimal
from wordcloud import WordCloud, STOPWORDS
from scipy import stats

Weather Data

In [2]:
# Load the weather dataset
df_weather = pd.read_csv("data/Boston weather_clean.csv")

For the purposes of answering questions about Snow in Boston, it would be helpful to know a few things about how likely it is that snow is around.

consecutive_snow_days

First of all, it would be helpful if each day was labelled by whether it was the first day of snowfall, or the third day of a long snow storm. For this reason, we add the consecutive_snow_days column.

days_since_last_snow

Next, for those days on which no snow was reported, it would be nice to know if there had recently been snowfall, which would cause snow to still affect 311 requests on that day. For this reason we added the days_since_last_snow column to the dataset.

accumulated_snow

Lastly, in case we want to look at accumulated snow over several days, we keep track of an accumulated_snow column, which adds up the snowfall in inches for consecutive snow days.

In [3]:
# Compute snow-related metrics
consecutive_snow_days = [0]
days_since_last_snow = [365]
accumulated_snow = [0]
for index, row in df_weather.iterrows():
    if row['Events'] == 'Snow' or row['Events'] == 'Both' or row['Snowfall (in)'] > 0:
        if consecutive_snow_days[-1] == 0:
            accumulated_snow.append(float(row['Snowfall (in)']))        
        else:
            accumulated_snow.append(accumulated_snow[-1] + row['Snowfall (in)'])
        consecutive_snow_days.append(consecutive_snow_days[-1] + 1)
        days_since_last_snow.append(0)
    else:
        accumulated_snow.append(accumulated_snow[-1])
        consecutive_snow_days.append(0)
        days_since_last_snow.append(days_since_last_snow[-1] + 1)
df_weather['consecutive_snow_days'] = consecutive_snow_days[1:]
df_weather['days_since_last_snow'] = days_since_last_snow[1:]
df_weather['accumulated_snow'] = accumulated_snow[1:]

311 Data

We downloaded the 311 dataset from data.boston.gov. In order to merge it with the weather dataset, we split the open date of each request into three columns (Year, Month, and Day). We chose the open date rather than the close date because we figured the weather most relevant to a request would be the weather at the time of opening, and not when it was closed.

In [4]:
# Load 311 dataset
df_311 = pd.read_csv('data/311.csv')
/Users/matt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [5]:
# Split out the date column into separate Year Month and Day columns
df_311['open_dt'] = pd.to_datetime(df_311['open_dt'])
df_311['Year'] = df_311['open_dt'].apply(lambda date: date.year)
df_311['Month'] = df_311['open_dt'].apply(lambda date: date.month)
df_311['Day'] = df_311['open_dt'].apply(lambda date: date.day)
In [6]:
# Merge the datasets. This performs an INNER JOIN which keeps only items
# that are present in both datasets.
df_311_weather = pd.merge(df_311, df_weather, on=['Year', 'Month', 'Day'])

Income Data

We found the median household income for each region of Boston, and decided to include it in our dataset to try to see if we can find some interesting correlations between income and 311 requests.

In [7]:
df_income = pd.read_csv('data/median income.csv', delimiter=";")
In [8]:
def convert_neighborhood(neighborhood):
    """
    Match neighborhoods in the 311 dataset with regions in the income dataset
    """
    conversions = [("Allston / Brighton", "Allston/Brighton"),
                   ("Allston", "Allston/Brighton"),
                   ("Brighton", "Allston/Brighton"),
                   ("Back Bay", "Back Bay/Beacon Hill"),
                   ("Beacon Hill", "Back Bay/Beacon Hill"),
                   ("Fenway / Kenmore / Audubon Circle / Longwood", "Fenway/Kenmore"),
                   ("Greater Mattapan", "Mattapan"),
                   ("Downtown / Financial District", "Boston"),
                   ("Mission Hill", "West Roxbury"),
                   ("South Boston / South Boston Waterfront", "South Boston"),
                   ("Chestnut Hill", "Allston/Brighton")]
    for left, right in conversions:
        if neighborhood == left:
            return right
    return neighborhood
In [9]:
df_311_weather['neighborhood'] = df_311_weather['neighborhood'].apply(convert_neighborhood)
In [10]:
# Clean up dollar strings into numbers
df_income['median household income'] = df_income['median household income'].apply(lambda money: Decimal(sub(r'[^\d.]', '', str(money))))
In [11]:
df_income
Out[11]:
region median household income population
0 Boston 52433 31.82
1 Hyde Park 53474 31.85
2 Charlestown 83926 16.44
3 East Boston 43511 40.51
4 Roxbury 30654 52.53
5 South End 51870 30.36
6 Back Bay/Beacon Hill 82742 31.82
7 Fenway/Kenmore 32509 38.38
8 Allston/Brighton 52362 74.80
9 Mattapan 42164 34.39
10 Roslindale 62702 27.62
11 West Roxbury 71066 30.44
12 Dorchester 45807 88.33
13 South Boston 63747 33.69
14 Central Boston 65662 1.98
15 Jamaica Plain 55861 41.26
In [12]:
df_311_weather_income = pd.merge(df_311_weather, df_income, left_on="neighborhood", right_on="region", how="left")

Our final combined dataset:

After merging all of our data sources into one dataset, we get a table in which each row is a 311 request, and the columns include information about that 311 request, what the weather was that day, and the median household income of the region in which the request was made. Below is a sample of this dataset, showing only the columns that are of interest to us. (Scroll right to see the new columns)

In [13]:
df = df_311_weather_income.filter(['Year', 'Month', 'Day', 'case_title', 'reason', 'type', 'neighborhood', 'High Temp (F)', 'Avg Temp (F)', 'Low Temp (F)', 'High Wind (mph)', 'Avg Wind (mph)', 'High Wind Gust (mph)', 'Snowfall (in)', 'Precip (in)', 'Events', 'consecutive_snow_days', 'days_since_last_snow', 'accumulated_snow', 'median household income', 'population'])
df
Out[13]:
Year Month Day case_title reason type neighborhood High Temp (F) Avg Temp (F) Low Temp (F) ... Avg Wind (mph) High Wind Gust (mph) Snowfall (in) Precip (in) Events consecutive_snow_days days_since_last_snow accumulated_snow median household income population
0 2011 7 1 Street Light Outages Street Lights Street Light Outages Mattapan 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 42164 34.39
1 2011 7 1 Schedule a Bulk Item Pickup Sanitation Schedule a Bulk Item Pickup Roslindale 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 62702 27.62
2 2011 7 1 Street Light Outages Street Lights Street Light Outages Roxbury 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 30654 52.53
3 2011 7 1 Highway Maintenance Highway Maintenance Highway Maintenance Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
4 2011 7 1 Sticker Request Recycling Sticker Request South Boston 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 63747 33.69
5 2011 7 1 Schedule a Bulk Item Pickup Sanitation Schedule a Bulk Item Pickup Roxbury 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 30654 52.53
6 2011 7 1 Schedule a Bulk Item Pickup Sanitation Schedule a Bulk Item Pickup West Roxbury 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 71066 30.44
7 2011 7 1 Schedule a Bulk Item Pickup Sanitation Schedule a Bulk Item Pickup Allston/Brighton 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 52362 74.80
8 2011 7 1 Schedule a Bulk Item Pickup Sanitation Schedule a Bulk Item Pickup Allston/Brighton 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 52362 74.80
9 2011 7 1 Graffiti: Ward 10 1009 Other Graffiti Graffiti Removal Jamaica Plain 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 55861 41.26
10 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
11 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
12 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
13 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
14 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
15 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
16 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
17 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
18 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
19 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
20 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
21 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
22 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
23 2011 7 1 Sidewalk Repair (Internal) Highway Maintenance Sidewalk Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
24 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
25 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
26 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Dorchester 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 45807 88.33
27 2011 7 1 Highway Maintenance Highway Maintenance Highway Maintenance East Boston 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 43511 40.51
28 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Hyde Park 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 53474 31.85
29 2011 7 1 Pothole Repair (Internal) Highway Maintenance Pothole Repair (Internal) Hyde Park 78 72 66 ... 7 18 0.0 0.00 None 0 7 0.01 53474 31.85
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1228622 2018 4 7 Poor Conditions of Property Code Enforcement Poor Conditions of Property South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228623 2018 4 7 PRINTED : JK Street Lights Street Light Outages Roxbury 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 30654 52.53
1228624 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement West Roxbury 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 71066 30.44
1228625 2018 4 7 Improper Storage of Trash (Barrels) Code Enforcement Improper Storage of Trash (Barrels) South End 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 51870 30.36
1228626 2018 4 7 Improper Storage of Trash (Barrels) Code Enforcement Improper Storage of Trash (Barrels) Jamaica Plain 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 55861 41.26
1228627 2018 4 7 Request for Pothole Repair Highway Maintenance Request for Pothole Repair Mattapan 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 42164 34.39
1228628 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Back Bay/Beacon Hill 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 82742 31.82
1228629 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 52433 31.82
1228630 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Back Bay/Beacon Hill 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 82742 31.82
1228631 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South End 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 51870 30.36
1228632 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 52433 31.82
1228633 2018 4 7 PWD Graffiti Highway Maintenance PWD Graffiti Fenway/Kenmore 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 32509 38.38
1228634 2018 4 7 Request for Pothole Repair Highway Maintenance Request for Pothole Repair Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 52433 31.82
1228635 2018 4 7 Request for Pothole Repair Highway Maintenance Request for Pothole Repair Jamaica Plain 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 55861 41.26
1228636 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228637 2018 4 7 Request for Pothole Repair Highway Maintenance Request for Pothole Repair South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228638 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228639 2018 4 7 Schedule Bulk Item Pickup Sanitation Schedule a Bulk Item Pickup SS Mattapan 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 42164 34.39
1228640 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228641 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 52433 31.82
1228642 2018 4 7 Animal Generic Request Animal Issues Animal Generic Request Mattapan 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 42164 34.39
1228643 2018 4 7 Request for Pothole Repair Highway Maintenance Request for Pothole Repair Mattapan 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 42164 34.39
1228644 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Dorchester 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 45807 88.33
1228645 2018 4 7 Rodent Activity Environmental Services Rodent Activity Back Bay/Beacon Hill 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 82742 31.82
1228646 2018 4 7 Pick up Dead Animal Street Cleaning Pick up Dead Animal Mattapan 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 42164 34.39
1228647 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228648 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Dorchester 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 45807 88.33
1228649 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 63747 33.69
1228650 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement South End 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 51870 30.36
1228651 2018 4 7 Parking Enforcement Enforcement & Abandoned Vehicles Parking Enforcement Boston 47 41 35 ... 12 27 0.0 0.01 Rain 0 1 0.21 52433 31.82

1228652 rows × 21 columns

Counting Snow-Related 311 Requests

In [14]:
# Make sure plt is in a clean state
plt.rcdefaults()

# changes the type of the values under the column “case_title” into a string
df["case_title"] = df["case_title"].astype(str)

# function that generates graphs according to which dataset has been passed in
def graph_snow_requests_vs_actual_requests(df):
    
    # lists all the data in which there was snowfall
    # by finding all the values under the “snowfall (in)” column that
    # has a value greater than 0.
    snowed = df["Snowfall (in)"] > 0
    snow_data = df[snowed]

    # filters dataset to give data in which the case title of the 311 report
    # includes the string “snow”
    has_snow = df["case_title"].str.contains("Snow")
    reason_snow = df[has_snow]
    
    # lists all the data in which there wasn't snowfall
    # by finding all the values under the “snowfall (in)” column that
    # has a value of exactly 0.
    no_snow = df["Snowfall (in)"] == 0

    # code to get dataset in which there wasn't snow but still had report(s) on snow
    # by filtering dataset by the conditions in which it didn’t snow, yet has a 
    # 311 report on that day in which the report title had the word “snow” in it. 
    no_snow_data = df[no_snow & has_snow]
    
    # counts the number of instances in which there was no snow, yet there was
    # at least one 311 report that had the word “snow” in its title
    count_no = no_snow_data['case_title'].count()
    
    # data has been filtered to show only those in which there was snow, and also
    # 311 reports with the word “snow” in its title
    yes_snow_data = df[snowed & has_snow]
    
    # counts the number of instances in which there was snow, and also
    # at least one 311 report that had the word “snow” in its title
    count_yes = yes_snow_data['case_title'].count()

    # code gets all the data in which the case title of the 311 report does
    # NOT have the word “snow” in it
    no_word = df["case_title"].str.contains("Snow") == False
    reason_nosnow = df[no_word]
    
    # filters dataset to get all the data in which there was snowfall, but no 311
    # report with the case title having the word “snow” in it
    yesSnow_noReport = df[snowed & no_word]
    
    # counts the number of instances in which there was snowfall, but no
    # 311 report that had the word “snow” in its case title
    count_noSnowReport = yesSnow_noReport['case_title'].count()
    
    # filters dataset to get all the data in which there was no snowfall,and no 311
    # reports with the case title having the word “snow” in it
    noSnow_noReport = df[no_snow & no_word]
    
    # counts the number of instances in which there was no snowfall, and also no
    # 311 reports with the word “snow” in its case title
    count_noSnownoReport = noSnow_noReport['case_title'].count()

    # First graph
    # Code to draw a bar graph where we can see the number of 311 reports that 
    # either has the word “snow” in its case title or not when there was snowfall.
    # The values of the bars are from the count of instances in which there was 
    # snowfall but no 311 reports that had the word “snow” in the case title, and 
    # another from when there was snowfall and also had 311 reports that had a case 
    # title with the word “snow” in it.
    x_pos = ('"snow" exists', '"snow" doesn\'t exist')
    y_pos = np.arange(len(x_pos))
    performance = [count_yes,count_noSnowReport]

    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, x_pos)
    plt.ylabel('Number of 311 requests')
    plt.xlabel('Whether the word "snow" exists within the 311 report title')
    plt.title('Having the word "snow" in 311 requests when there WAS snowfall')

    plt.show()
    
    # Second graph
    # Code to draw the bar graph in which we can see the number of 311 reports that 
    # either has the word “snow” in its case title or not when there was NO snowfall.
    # The values of the bars are from the count of instances in which there was no
    # snowfall, and also no 311 reports that has the word “snow” in its case title, and
    # another count from when there was no snowfall, yet there were 311 reports in which
    # the case title had the word “snow” in its case title
    x_pos = ('"snow" exists', '"snow" doesn\'t exist')
    y_pos = np.arange(len(x_pos))
    performance = [count_no,count_noSnownoReport]

    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, x_pos)
    plt.ylabel('Number of 311 requests')
    plt.xlabel('Whether the word "snow" exists within the 311 report title')
    plt.title('Having the word "snow" in 311 requests when there WAS NO snowfall')

    plt.show()
In [15]:
graph_snow_requests_vs_actual_requests(df)

Note: This dataset is from 2011 to 2018, inclusive of all seasons

Interpretation of FIRST bar graph above:

There are more reports in which the word “snow” doesn’t exist within the case title even though there is snow, and there are many reasons as to why that might be the case. First of all, “snowfall” in our dataset is anything where there is more than 0 inches of snow, and when there is only an inch or so of snow, it doesn’t really negatively impact the community so much. Also, not all snow-related reports include the word “snow” in the title. It can be something like “frozen road”, or “requests for street cleaning” and so forth, which means those reports will be omitted from our count because we are only looking at data in which the word “snow” is explicitly included in the title.

Interpretation of SECOND bar graph above:

Looking at the graph, there are more reports in which the word “snow”  doesn’t exist within the case title when there is no snow. The most obvious reason as to why there are more reports explicitly unrelated to  snow is because this dataset includes all seasons from 2011 to 2018.  Boston gets pretty snowy during the winter months, but it is rare to  have snow from spring to fall, and because our dataset includes all  seasons, obviously, we will have many more reports that are unrelated to snowfall.

In [16]:
winter_2014_2015 = (df['Year'] == 2014) & (df['Month'] > 8) | ((df['Year'] == 2015) & (df['Month'] < 5))
graph_snow_requests_vs_actual_requests(df[winter_2014_2015])

Note: This dataset is from 2014 to 2015, only including the winter months

Interpretation of FIRST bar graph:

Unlike the bar graph above with the same question, with this dataset that only includes the winter months, we now have more 311 requests with the word “snow” in the case title. We explicitly picked the winter of 2014 to 2015 because those were the years in which there was a lot of snowfall, and probably because of that reason, we subsequently also had a lot more 311 requests relating to snow.  As stated above, this count omits requests that do not explicitly have the word “snow” in the case title, but even then, we still have  more reports with the word “snow” within the title. This fact tells us that the  winter of 2014-2015 was indeed quite severe in terms of the degree of how much the weather negatively affected the residents of Boston.

Interpretation of SECOND bar graph:

Similar to the bar graph above with the same question, the graph of the dataset  that only includes the winter months also has more 311 requests where the word “snow” doesn’t exist in the case title when there was no snowfall. The reason why is probably due to the fact that because there was no snow for the day, there were also less problems, or at least less severe problems related to snow. We still do have reports explicitly relating to snowfall, and those are probably due to the remnants of snow from the snowy weather the day before or so.

311 Request Types in Extreme Weather

Next, we investigate a few types of extreme weather, and how these types of weather may affect the kinds of 311 requests that come in. These are windy, snowy, chilly. It is interesting to see which sorts of requests occur more for certain types of weather than others.

One visually effective way to observe trends in word choice is by generating word clouds. Below, we define which requests fit our criteria for each weather type, and then we generate wordclouds from the request text for those requests.

In order to help define a "snowy" day, we use the days_since_last_snow metric we came up with in the datasets section of this project. The reason for this is we do want to include the day after a snowfall because we expect snow removal requests to continue into the next day. While many snow requests happen exactly on the day that it snowed, some happen in the next days as demonstrated in the calculation below.

In [17]:
# What is the average days_since_last_snow for snow removal requests
# 1. filter the table for only entries about snow removal
plow_requests = df['case_title'].str.contains("Snow", na=False)
# 2. for those entries, compute the average of days_since_last_snow
print(df[plow_requests & winter_2014_2015]['days_since_last_snow'].mean())
print(df[plow_requests]['days_since_last_snow'].mean())
0.281981759921124
0.41537880548454564
In [18]:
## WINDY ##
windy = df['Avg Wind (mph)'] > 30

## SNOWY ##
snowy = df['days_since_last_snow'] < 2

## CHILLY ##
celcius = df['Low Temp (F)'].apply(lambda x: (x - 32) * (5.0/9.0))
chilly = celcius < (-19)
In [19]:
def make_wordcloud_from_text(text):
    wordcloud = WordCloud(
        width = 3000,
        height = 2000,
        background_color = 'white',
        stopwords = STOPWORDS).generate(text)
    fig = plt.figure(
        figsize = (20, 15),
        facecolor = 'white',
        edgecolor = 'white')
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()
    
def make_wordcloud(df):
    text_cols = ['reason', 'type']
    make_wordcloud_from_text(" ".join(df[col_name].str.cat(sep=" ") for col_name in text_cols))

Snowy

The first type of extreme weather that we showcase is the snowy case. We define this to be days on which it snowed, as well as the two days immediately following a snow day. The biggest words are Snow Plowing Request, which fits our expectations. There are also requests to repair potholes, and reports of missed trash pickups. There are many requests for Scheduled Bulk Item Pickup, which baffles us a little bit - perhaps these are reports of missed pickups.

An interesting snow-related request is "Parking Enforcement" - it would be interesting to know how many of these are people who had marked their spot with lawn chairs. More likely, since snow makes many previous parking spaces unviable, people are more likely to park illegally and thus trigger more parking enforcement-related 311 calls.

In [20]:
make_wordcloud(df[snowy])

Windy

In the wordcloud below, observe the words that come up most often for windy days (days with an average windspeed of 30 mph or higher). Tree related requests, as well as Downed Wire reports are very common. Street lights, traffic lights, signal repair, all seem to show that Boston's traffic flow control infrastructure is vulnerable to high winds. There are also many requests for street cleaning

In [21]:
make_wordcloud(df[windy])

Chilly

Snow is only one of the challenges Bostonians face in the winter. The other one is just sheer cold. What kinds of calls do Bostonians make in subfreezing temperatures?

The biggest change from the snowy wordcloud are the words "Heat Excessive Insufficient" which seems like a catch-all type for heating related issues. When it is chilly, it is more likely for people to issue complaints about their heating systems.

In [22]:
make_wordcloud(df[chilly])

Outlier: Valentine's Day 2016

Our first definition of chilly looked only at temperatures below negative 20 degrees Celcius. This turned out to be too aggressive a cutoff, since it left us with only one day on the record: February 14th, 2016. On this day, there were several requests for "Heat Excessive Insufficient" from Charlestown. While it made a nice (and different-looking) wordcloud, we soon realized that this was merely due to a small amount of data. By moving the threshold up to negative 19 degrees Celcius, it included a couple more years worth of information (but still in the teens of February, which seems to be the coldest time in Boston).

311 Reports by Region

Here, we correlate the frequency of 311 reports with each region in Boston by counting the occurrences of each neighborhood in the "neighborhood" column of the 311 dataset.

In [23]:
# Data manipulation: get counts of reports by neighborhood
def plot_value_counts(value_counts, tag=""):
    x_pos = np.arange(len(value_counts.keys()))
    plt.bar(x_pos, list(value_counts[key] for key in value_counts.keys()), align='center',
            color='green', ecolor='black')
    plt.xticks(x_pos, value_counts.keys(), rotation='vertical')
    plt.xlabel("Neighborhood")
    plt.ylabel("Number of 311 Reports Divided By Population")
    plt.title("Region Correlated With Frequency of 311 Reports" + ("" if len(tag) == 0 else ": {}".format(tag)))
    plt.show()
    
def plot_reports_by_region(df, tag=""):
    new_df = df.groupby(by="neighborhood").aggregate({'case_title': 'count',
                                                      'population': 'max'})

    plot_value_counts((new_df['case_title'] / new_df['population']).sort_values(ascending=False), tag=tag)
In [24]:
plot_reports_by_region(df)

The bar graph above is normalized by population in thousands, and shows the frequency of 311 requests made per region in Boston.  The first time we generated the graph, we did not normalize the data, so a populated region like Dorchester had the highest number of frequency of 311 reports made. Without the normalization of data, there is little value as to what we can analyze from it, so we went ahead and divided the frequency of the reports with the population of the neighborhood.

Now looking at the normalized data, we see that Boston has overall the most number of 311 reports made. The population data we used for Boston isn’t very accurate however (we couldn’t find the exact number, so we just approximated it by looking at population data from other regions), therefore we will take Boston’s values with a grain of salt. Coming up next is West Roxbury with the second highest number of reports made overall, and the lowest number of reports made is by Fenway/Kenmore. With the normalization of data, Dorchester is no longer the neighborhood that produces the most number of 311 requests.

In [25]:
# Three different bar graphs showing the frequency of 311 reports in each region of Boston
# by three different extreme weather conditions: snowy, windy, and chilly.
plot_reports_by_region(df[snowy], tag="Snowy")
plot_reports_by_region(df[windy], tag="Windy")
plot_reports_by_region(df[chilly], tag="Chilly")

We correlated different types of 311 reports with frequency of requests and region in Boston. West Roxbury makes the most requests related to snowy and windy weather, while Charlestown makes the most number of requests concerning freezing temperatures. Fenway/Kenmore remains to be the region that makes the least number of 311 reports across all conditions. We wondered if there was a reason why Fenway/Kenmore had the lowest number of 311 requests made, so we took a look at its income data, and found out that its income level was the second-lowest within all the regions of Boston. Also according to Google, Fenway/Kenmore is known to be quite poverty-prone. Now being curious about how income is related to frequency of 311 requests made, we made the scatterplots below.

311 Reports by Median Household Income

After looking at which regions have the most 311 requests, we looked at the median household income of each of those regions to try to find a correlation between median household income and frequency of 311 requests.

In [26]:
def plot_requests_by_income(df, tag=""):
    # Data manipulation: get counts of reports by region
    new_df = df.groupby('median household income').aggregate({'neighborhood': 'min', 'case_title': 'count', 'population': 'max'})
    value_counts = new_df['case_title'] / new_df['population']

    fig, ax = plt.subplots()

    x = list(map(lambda x: int(x), value_counts.keys()))
    y = list(value_counts[key] for key in value_counts.keys())

    # Plot how many of our reports come from low vs high income areas
    #ax.scatter(x, y)
    plt.xlabel("Income ($)")
    plt.ylabel("Number of 311 Requests Divided By Population")
    plt.title("Income Correlated With Number of 311 Reports" + ("" if len(tag) == 0 else ":\n{}".format(tag)))

    fit = np.polyfit(x, y, 1)
    fit_fn = np.poly1d(fit) 
    plt.plot(x,y, 'yo', x, fit_fn(x), '--k')


    for i, txt in enumerate(value_counts.keys()):
        ax.annotate(df[df['median household income'] == txt]['neighborhood'].min(), (x[i], y[i]))
        
    print("rvalue:", stats.linregress(x, y).rvalue)

    plt.show()
In [27]:
plot_requests_by_income(df)
rvalue: 0.5531624183339354

The scatterplot above has been normalized by dividing the frequency of 311 requests made with the population of region. We do in fact, see a trend where the lower income regions have lower number of 311 requests made, while the higher income regions have more 311 requests made. The calculated r-value is 0.55, which means the correlation is not by chance, but at the same time, the correlation isn’t quite concrete either. It’s barely over chance actually. When we first started this project, we hypothesized that the poorer regions would make more 311 requests due to less adequate living conditions, however, the data argues otherwise; richer regions are more verbose in filing 311 requests. We thought of several reasons why. Perhaps people from more poverty-prone regions are too busy with their everyday survival to care too much about the surrounding environment. Maybe these people do not have access to the internet or phone. It is also possible that richer people are just more picky about their living conditions and how well maintained their neighborhood is. We then took a look into the relationship between types of 311 requests and income per region.

In [28]:
plot_requests_by_income(df[df['reason'] == 'Street Cleaning'])
rvalue: 0.46134182984096306

Diving deeper into details, let's look at a specific reasoning for filing 311 requests; in particular, those related to street cleaning. This scatterplot looks nearly identical to the general scatterplot above, however, the r-value has decreased a little to a mere 0.46, so the correlation is now at chance level.

In [29]:
plot_requests_by_income(df[snowy])
rvalue: 0.5850983356962013

We will now look only into 311 requests relating to snowy weather. The r value is 0.59, so the correlation is weak, yet above chance. The scatterplot shows basically the same trend as those of the above, where regions with less income have lower numbers of 311 requests made. Some major complaints during weather with heavy snowfall are those relating to snow shoveling, abandoned vehicles, and parking enforcements. For one thing, we can see how poorer regions may have less problems with vehicles, given that it's usually cheaper to use public transportation than to actually own a car and take care of it. If my assumption is correct and regions with lower income do actually have less people who own cars, then it makes a little more sense as to why there would at least be a lower number of reports about abandoned vehicles and parking enforcements.

In [30]:
plot_requests_by_income(df[chilly][df[chilly]['reason'] == 'Housing'], tag="Housing-Related Requests on Chilly Days")
rvalue: -0.3535191310334862

This scatterplot looks specifically at the number of 311 requests made that are related to housing during freezing temperatures. Here, we can see that for once, we have more reports coming from lower-income regions. This trend may be due to worse living/housing conditions of those who are more in poverty. As one would expect someone living in a wealther neighborhood to have better heating, ventilation, and studier housing in general, these people may suffer from less setbacks due to cold than those who live in housing conditions that are not quite as decent. However, we will have to keep in mind that the r-value is only a -0.35, so the correlation is rather weak.

Conclusion

Our most general question of this project was to see the relationship between 311 requests and the weather in Boston. We then made the question a little more specific by looking into the frequency and types of 311 requests made in correlation with weather and region in Boston.

The first section of our Jupyter Notebook looks into how the presence of snowfall is related to the types of 311 requests made. We counted the number of requests that were explicitly related or unrelated to snow in correlation to whether there was actually snowfall or not. Using two different datasets (the first dataset that includes all seasons from 2011 to 2018 and another dataset that only includes the winter of 2014 to 2015), we were able to see how the severity of snowfall affected what what types of 311 requests were made. With heavy snowfall came more 311 requests strongly related to snow, such as requests for “snow plowing” and “street cleaning”. We can see this relationship in the word clouds made in the next section of the notebook, where we were able to see the different types of 311 requests made under various kinds of extreme weather.

Taking a look at the word cloud titled “Snowy”, we can see which 311 requests are most frequently requested when there are more than two inches of snowfall. Unsurprising, most requests asked for snow plowing, maintenance of streets and highways, clearance of abandoned vehicles, and problems with heat. Snowy weather seemed to have a link with chilly weather of under negative 20 degrees celsius, where if you look at the word cloud for “Chilly”, you can see some of the requests that were also under “Snowy”, including requests for “street cleaning/snow plowing”, maintenance of highways/streets/sidewalks, “abandoned vehicles”, and various heat problems. One more weather condition related to the cold is windiness, and looking into the word cloud under the title “Windy”, we see lots of problems with some kind of “tree emergencies”, “street cleaning”, and problems with lights, wires, signs, and traffic signals.

Overall, with extreme weather relating to cold (snow, wind, and under freezing temperatures), Boston has problems relating to the maintenance of streets and highways, inadequate heating conditions, and abundance of abandoned vehicles.

We then looked into which regions of Boston that generate the most number of 311 requests. Boston had the most number of 311 reports made overall. Next up came West Roxbury with the second highest number of reports made, and Fenway/Kenmore had the lowest number of reports made. In fact, for the most part, Fenway/Kenmore consistently had the lowest number of 311 requests made despite which type of 311 reports we were looking at, which led us to look into income per region to see if there was a correlation with income and frequency of 311 requests.

We drew scatterplots to see the relationship with income and number of 311 requests, and all the scatterplots had weak correlations, although most scatterplots still had an above-chance-correlation where the less the median income of the region is, the less the number of 311 requests for that region.

The main takeaway from this project is the fact that both weather patterns and income have some correlation with the types of 311 requests received. This is altogether unsurprising; however, given what we found, the City of Boston can continue to prepare itself to handle tree issues when it's windy and build-up of snow on streets when it's snowy.

Group Roles

All three of us have evenly contributed to finding datasets to use. Wanqi translated the online data of income by region into a CSV format, and Matthew worked on combining the datasets together to form the ultimate CSV file we used for this assignment. Most of the coding and visualizations were done by Matthew, including adding more necessary columns onto the dataset, creating graphs, and generating word clouds. Jin took care of most of the analysis of data, including commenting on code to help audiences better understand what the code is for, and writing interpretations on the what can be observed from the visualizations. Wanqi wrote the introduction for the assignment, and Jin wrote the conclusion.